A Database of Glyphs for OCR of Mathematical Documents

نویسندگان

  • Alan P. Sexton
  • Volker Sorge
چکیده

Automatic document analysis tools for mathematical texts are necessary to enlarge the pool of mathematical knowledge available in electronic form. However, development of such tools is currently hindered by the weakness of optical character recognition systems in dealing with the large range of mathematical symbols and the often subtle but important distinctions in font usage in mathematical texts. Research on developing better systems for mathematical optical character recognition crucially depends on having an extensive, high quality database of glyphs used in mathematical texts for training and test purposes. We present such a database of symbols constructed from a large set of characters available in the LTEX document preparation system that can serve as a basis mathematical text recognition. We describe its integration into a prototypical system optical character recognition system for mathematics that enables the construction of LTEX source documents from mathematical documents available as images. From the lessons learned in this work we derive a road map for further research into the area of mathematical text analysis.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extraction of Logical Structure from Articles in Mathematics

We propose a mathematical knowledge browser which helps people to read mathematical documents. By the browser printed mathematical documents can be scanned and recognized by OCR (Optical Character Recognition). Then the meta-information (e.g. title, author) and the logical structure (e.g. section, theorem) of the documents are automatically extracted. The purpose of this paper is to show the ex...

متن کامل

Vectorisation of Glyphs and Their Representation in SVG for XML-based Processing

This paper shows an approach for converting bitmap images of text glyphs into a vector format which is suitable for being embedded in XML representations of digitized documents. The focus is on a contour based vectorisation method as the output can be easily transformed into SVG glyph descriptions. A concrete implementation is described and the results are discussed with special regard to the v...

متن کامل

Vectorization of Glyphs and Their Representation in SVG for XML based Processing

This paper shows an approach for converting bitmap images of text glyphs into a vector format which is suitable for being embedded in XML representations of digitized documents. The focus is on a contour based vectorization method as the output can be easily transformed into SVG glyph descriptions. A concrete implementation is described and the results are discussed with special regard to the v...

متن کامل

Font group identification using reconstructed fonts

Ideally, digital versions of scanned documents should be represented in a format that is searchable, compressed, highly readable, and faithful to the original. These goals can theoretically be achieved through OCR and font recognition, re-typesetting the document text with original fonts. However, OCR and font recognition remain hard problems, and many historical documents use fonts that are no...

متن کامل

Perceptual Organization in Semantic Role Labeling

Documents are produced for the purpose of human interpretation. Human perceptual factors have played an important role in the design of documents — from the development of glyphs and scripts to the layout of visual components. OCR technology allows recovery of textual content from images of text but does not recover visual imformation encoded in layout. We explore the role of perceptual organiz...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005